In [ ]:
%load_ext lab_black

import warnings

warnings.filterwarnings("ignore")

Introduction¶

In this chapter, we dive into artificial neural networks, one of the main drivers of artificial intelligence.

Neural networks are around since many decades. (Maybe) the first such model was built by Marvin Minsky in 1951. He called his algorithm SNARC ("stochastic neural-analog reinforcement calculator"). Since then, neural networks have gone through several stages of development. One of the milestones was the idea of Paul J. Werbos in 1974 [1] to efficiently calculate gradients in the optimization algorithm by an approach called "backpropagation". Another milestone was the use of GPUs (graphics processing units) to greatly reduce calculation time.

Artificial neural nets are extremely versatile and powerful. They can be used to

  1. fit simple models like GLMs,
  2. learn interactions and non-linear effects in an automatic way (like tree-based methods),
  3. optimize general loss functions,
  4. fit data much larger than RAM (e.g. images),
  5. learn "online" (update the model with additional data),
  6. fit multiple response variables at the same time,
  7. model input of dimension higher than two (e.g. images, videos),
  8. model input of different input dimensions (e.g. text and images),
  9. fit data with sequential structure in both in- and output (e.g. a text translator),
  10. model data with spatial structure (images),
  11. fit models with many millions of parameters,
  12. do non-linear dimension reduction.

In this chapter, we will mainly deal with the first three aspects. Since a lot of new terms are being used, a small glossary can be found in Section "Neural Network Slang".

Understanding Neural Nets¶

To learn how and why neural networks work, we will go through three steps - each illustrated on the diamonds data:

  • Step 1: Linear regression as neural net
  • Step 2: Hidden layers
  • Step 3: Activation functions

After this, we will be ready to build more complex models.

Step 1: Linear regression as neural net¶

Let us revisit the simple linear regression $$ E(\text{price}) = \alpha + \beta \cdot \text{carat} $$ calculated on the full diamonds data. In Chapter 1 we have found the solution $\hat\alpha = -2256.36$ and $\hat \beta = 7756.43$ by ordinary least-squares.

Above situation can be viewed as a neural network with

  • an input layer with two nodes (carat and the intercept called "bias unit" with value 1),
  • a "fully connected" (= "dense") output layer with one node (price). Fully connected means that each node of a layer is a linear function of all node values of the previous layer. Each linear function has parameters or weights to be estimated, in our simple case just $\alpha$ and $\beta$.

Visualized as a graph, the situation looks as follows.

Part of the figures were done with this cool webtool.

To gain confidence in neural nets, we first show that parameters estimated by a neural network are quite similar to the ones learned by linear least-squares. To do so, we will use Google's TensorFlow with its convenient (functional) Keras interface.

Example: simple linear regression¶

In [ ]:
# Define model
from plotnine.data import diamonds
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import RootMeanSquaredError as rmse

# Input layer: we have 1 covariate
inputs = keras.Input(shape=1)

# Output layer densely connected to the input layer
outputs = layers.Dense(1)(inputs)

# Create model
model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 dense (Dense)               (None, 1)                 2         
                                                                 
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# Compile model, i.e., calculate derivatives of architecture
model.compile(
    loss="mse",
    optimizer=keras.optimizers.Adam(learning_rate=1),
    metrics=[rmse()],
)
In [ ]:
# Fit model - naive without validation
history = model.fit(
    x=diamonds["carat"], y=diamonds["price"], epochs=30, batch_size=100, verbose=0
)

# Fitted coefficients
print("Fitted coefficients with neural net:", model.get_weights())
Fitted coefficients with neural net: [array([[7722.846]], dtype=float32), array([-2217.7605], dtype=float32)]
In [ ]:
# Compare with linear regression
import statsmodels.formula.api as smf

ols = smf.ols(formula="price ~ carat", data=diamonds).fit()
print("Fitted with OLS:", ols.params, sep="\n")
Fitted with OLS:
Intercept   -2256.360580
carat        7756.425618
dtype: float64
In [ ]:
# Plot training RMSE over epochs
import matplotlib.pyplot as plt

plt.plot(history.history["root_mean_squared_error"])
plt.gca().set(title="Training RMSE over epochs", ylabel="RMSE", xlabel="Epoch")
plt.grid()
In [ ]:
# Effect of carat on average price
import numpy as np

carat = np.linspace(0, 3, 30)
plt.plot(carat, model(carat), marker="o")
plt.gca().set(title="Effect of carat on average price", ylabel="price", xlabel="carat")
plt.grid()

Comment: The solution of the simple neural network is indeed quite similar to the OLS solution.

The optimization algorithm¶

Neural nets are typically fitted by mini-batch gradient descent, using backpropagation to efficiently calculate gradients. It works as follows:

  1. Initiate the parameters with random values.
  2. Forward step: Use the parameters to predict all observations of a batch. A batch is a randomly selected subset of the full data set.
  3. Backpropagation step: Change the parameters in the right direction, making the average loss $Q$ (e.g., the MSE) of the current batch smaller. This involves calculating derivatives ("gradients") of $Q$ with respect to all parameters. Backpropagation does so in a layer-per-layer fashion, making heavy use of the chain rule.
  4. Repeat Steps 2-3 until each observation appeared in a batch. This is called an epoch.
  5. Repeat Step 4 for multiple epochs until the parameter estimates stabilize or validation performance stops improving.

Gradient descent on batches of size 1 is called "stochastic gradient descent" (SGD).

Step 2: Hidden layers¶

Our first neural network above consisted of only an input layer and an output layer. By adding one or more hidden layers between in- and output, the network gains additional parameters, and thus more flexibility. The nodes of a hidden layer can be viewed as latent variables, representing the original covariates. The nodes of a hidden layer are sometimes called encoding. The closer a layer is to the output, the better its nodes are suitable to predict the response variable. In this way, a neural network finds the right transformations and interactions of its covariates in an automatic way. The only ingredients are a large data set and a flexible enough network "architecture" (number of layers, nodes per layer).

Neural nets with more than one hidden layer are called "deep neural nets".

We will now add a hidden layer with five nodes $v_1, \dots, v_5$ to our simple linear regression network. The architecture looks as follows:

This network has 16 parameters. How much better than our simple network with just two parameters will it be?

Example: hidden layer¶

The following code is almost identical to the last one up, except that there is a hidden layer between input and output layer.

In [ ]:
# Define and fit model
from plotnine.data import diamonds
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import RootMeanSquaredError as rmse

# Input layer: we have 1 covariate
inputs = keras.Input(shape=1)

# One hidden layer with 5 nodes
hidden = layers.Dense(5)(inputs)  # new line of code!

# Output layer now connected to the hidden layer
outputs = layers.Dense(1)(hidden)  # modified

# Create model
model = keras.Model(inputs=inputs, outputs=outputs)
# model.summary()

# Compile model
model.compile(
    loss="mse",
    optimizer=keras.optimizers.Adam(learning_rate=1),
    metrics=[rmse()],
)

# Fit model - naive without validation
model.fit(
    x=diamonds["carat"], y=diamonds["price"], epochs=30, batch_size=100, verbose=0
)
Out[ ]:
<keras.callbacks.History at 0x2b98651e2e0>
In [ ]:
# Plot effect
import numpy as np
import matplotlib.pyplot as plt

carat = np.linspace(0, 3, 30)
plt.plot(carat, model(carat), marker="o")
plt.gca().set(title="Effect of carat on average price", ylabel="price", xlabel="carat")
plt.grid()

Comment: Oops, it seems as if the extra hidden layer had no effect. The reason is that a linear function of a linear function is still a linear function. Adding the hidden layer did not really change the capabilities of the model. It just added a lot of unnecessary parameters.

Step 3: Activation functions¶

The missing magic component is the so called activation function $\sigma$ after each layer, which transforms the values of the nodes. So far, we have implicitly used "linear activations", which - in neural network slang - is just the identity function.

Applying non-linear activation functions after hidden layers have the purpose to introduce non-linear and interaction effects. Typical such functions are

  • the hyperbolic tangent $\sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ ("S"-shaped function that maps real values to $[-1, 1]$),
  • the standard logistic function ("sigmoid") $\sigma(x) = 1 / (1 + e^{-x})$ ("S"-shaped function that maps real values to $[0, 1]$, shifted and scaled hyperbolic tangent),
  • the rectangular linear unit "ReLU" $\sigma(x) = \text{max}(0, x)$ that sets negative values to 0.

Activation functions applied to the output layer have a different purpose, namely the same as the inverse of the link function of a corresponding GLM. It maps predictions to the scale of the response:

  • identity/"linear" activation $\rightarrow$ usual regression
  • logistic activation $\rightarrow$ binary logistic regression (one probability)
  • softmax activation $\rightarrow$ multinomial logistic regression (one probability per class)
  • exponential activation $\rightarrow$ log-linear regression as with Poisson or Gamma regression

Let us add a hyperbolic tangent activation function ($\sigma$) after the hidden layer of our simple example.

Example: activation functions¶

Again, the code is very similar to the last one, with the exception of using a hyperbolic tangent activation after the hidden layer (and different learning rate and number of epochs).

In [ ]:
# Define and fit model
from plotnine.data import diamonds
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import RootMeanSquaredError as rmse

# Input layer: we have 1 covariate
inputs = keras.Input(shape=1)

# One hidden layer with 5 nodes
hidden = layers.Dense(5, activation="tanh")(inputs)

# Output layer now connected to the hidden layer
outputs = layers.Dense(1, activation="linear")(hidden)

# Create model
model = keras.Model(inputs=inputs, outputs=outputs)
# model.summary()

# Compile model
model.compile(
    loss="mse",
    optimizer=keras.optimizers.Adam(learning_rate=0.2),
    metrics=[rmse()],
)

# Fit model - naive without validation
model.fit(
    x=diamonds["carat"], y=diamonds["price"], epochs=50, batch_size=100, verbose=0
)
Out[ ]:
<keras.callbacks.History at 0x2b986672e20>
In [ ]:
# Plot effect
import numpy as np
import matplotlib.pyplot as plt

carat = np.linspace(0, 3, 30)
plt.plot(carat, model(carat), marker="o")
plt.gca().set(title="Effect of carat on average price", ylabel="price", xlabel="carat")
plt.grid()

Comment: Adding the non-linear activation after the hidden layer has changed the model. The effect of carat is now representing the association between carat and price by a non-linear function.

Practical Considerations¶

Validation and tuning of main parameters¶

So far, we have naively fitted the neural networks without splitting the data for test and validation. Don't do this! Usually, one sets a small test dataset (e.g. 10% of rows) aside to assess the final model performance and use simple (or cross-)validation for model tuning.

In order to choose the main tuning parameters, namely

  • network architecture,
  • activation functions,
  • learning rate,
  • batch size, and
  • number of epochs,

one often uses simple validation because cross-validation takes too much time.

Missing values¶

A neural net does not accept missing values in the input. They need to be filled, e.g., by a typical value or a value below the minimum.

Input standardization¶

Gradient descent starts by random initialization of parameters. This step is optimized for standardized input. Standardization has to be done manually by either

  • min/max scale the values of each input to the range -1 to 1,
  • standard scale the values of each input to mean 0 and standard deviation 1, or
  • use relative ranks.

Note that the scaling transformation is calculated on the training data and then applied to the validation and test data. This usually requires a couple of lines of code.

Categorical input¶

There are three common ways to represent categorical input variables in a neural network.

  1. Binary and ordinal categoricals are best represented by integers and then treated as numeric.
  2. Unordered categoricals are either one-hot-encoded (i.e., each category is represented by a binary variable) or
  3. they are represented by a (categorical) embedding. To do so, the categories are integer encoded and then condensed by a special embedding layer to a few (usually 1 or 2) dense features. This requires a more complex network architecture but saves memory and preprocessing. This approach is heavily used when the input consists of words (which is a categorical variable with thousands of levels - one level per word).

For Option 2, input standardization is not required, for Option 3 it must not be applied as the embedding layer expects integers.

Callbacks¶

Sometimes, we want to take actions during training, such as

  • stop training when validation performance starts worsening,
  • reduce the learning rate when the optimization is stuck in a "plateau", or
  • save the network weights between epochs.

Such monitoring tasks are called callbacks. We will see them in the example below.

Types of layers¶

So far, we have encountered only dense (= fully connected) layers and activation layers. Here some further types:

  • Embedding layers to represent integer encoded categoricals.
  • Dropout layers to add regularization.
  • Convolutional and pooling layers for image data.
  • Recurrent layers (long-short-term memory LSTM, gated recurrent unit GRU) for sequence data.
  • Concatenation layers to combine different branches of the network (like in a directed graph).
  • Flatten layers to bring higher dimensional layers to dimension 1 (relevant, e.g., for embeddings, image and text data).

Optimizer¶

Pure gradient descent is rarely applied without tweaks because it tends to be stuck in local minima, especially for complex networks with non-convex objective surfaces. Modern variants are "adam", "nadam" and "RMSProp". These optimizers work usually out-of-the-box, except for the learning rate, which has to be manually chosen.

Custom losses and evaluation metrics¶

Frameworks like Keras/TensorFlow offer many predefined loss functions and evaluation metrics. Choosing them is a crucial step, just as with tree boosting. Using TensorFlow's backend functions, one can define own metrics and loss functions (see exercises).

Overfitting and regularization¶

As with linear models, a model with too many parameters will overfit in an undesired way. With about 50 to 100 observations per parameter, overfitting is usually unproblematic. (For image and text data, different rules apply). Besides using less parameters, the main options to reduce overfitting are the following:

  • pull the parameters of a layer slightly towards zero by applying L1 and/or L2 penalties to the parameters,
  • add dropout layers. A dropout layer randomly sets some of the node values of the previous layer to 0, switching them off. This is an elegant way to fight overfitting and is related to bagging.

Choosing the architecture¶

How many layers and number of nodes per layer to select? For tabular data, using 1-3 hidden layers is usually enough. If we start with $m$ input variables, the number of nodes in the first hidden layer is usually higher than $m$ and reduces for later layers. There should not be a "representational bottleneck", i.e., an early hidden layer with too few parameters.

The number of parameters should not be too high compared to the number of rows, see "Overfitting and regularization" above.

Interpretation¶

Variable importance of covariates in neural networks can be assessed by permutation importance (how much performance is lost when shuffling column X?) or SHAP importance. Covariate effects can be investigated, e.g., by partial dependence plots or SHAP dependence plots.

Example: diamonds¶

We will now fit a neural net with two hidden layers (30 and 15 nodes) and a total of 631 parameters to model diamond prices. Learning rate, activation functions, and batch size were manually chosen by simple validation. The number of epochs is automatically being chosen by an early stopping callback.

In [ ]:
# Preprocessing
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Train/test split
df_train, df_test, y_train, y_test = train_test_split(
    diamonds, diamonds["price"], test_size=0.2, random_state=341
)

# Data preprocessing pipeline
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]

preprocessor = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("ordinal", OrdinalEncoder(categories=ord_levels), ord_features),
            ("numeric", "passthrough", ["carat"]),
        ]
    ),
    StandardScaler(),
)

X_train = preprocessor.fit_transform(df_train)
X_test = preprocessor.transform(df_test)
X_test[0:2]  # Check
Out[ ]:
array([[ 0.98143089, -1.5256479 ,  0.57680845, -0.60749424],
       [-0.80744346, -1.5256479 , -1.24422106,  1.01420318]])
In [ ]:
# Define model
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.metrics import RootMeanSquaredError as rmse

# Input layer: we have 4 covariates
inputs = keras.Input(shape=4)

# Two hidden layers with contracting number of nodes
x = layers.Dense(30, activation="relu")(inputs)
x = layers.Dense(15, activation="relu")(x)

# Output layer now connected to the last hidden layer
outputs = layers.Dense(1, activation="linear")(x)

# Create model
model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()
Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_4 (InputLayer)        [(None, 4)]               0         
                                                                 
 dense_5 (Dense)             (None, 30)                150       
                                                                 
 dense_6 (Dense)             (None, 15)                465       
                                                                 
 dense_7 (Dense)             (None, 1)                 16        
                                                                 
=================================================================
Total params: 631
Trainable params: 631
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# Compile and fit model
model.compile(
    loss="mse",
    optimizer=keras.optimizers.Adam(learning_rate=0.3),
    metrics=[rmse()],
)

# Callbacks
cb = [
    keras.callbacks.EarlyStopping(patience=20),
    keras.callbacks.ReduceLROnPlateau(patience=5),
]

# Fit model
tf.random.set_seed(498)

history = model.fit(
    x=X_train,
    y=y_train,
    epochs=1000,
    batch_size=400,
    validation_split=0.2,
    callbacks=cb,
    verbose=0,
)
In [ ]:
# Plot RMSE over epochs
import matplotlib.pyplot as plt

plt.plot(history.history["root_mean_squared_error"], label="Training")
plt.plot(history.history["val_root_mean_squared_error"], label="Validation")
plt.legend()
plt.gca().set(title="RMSE over epochs", xlabel="Epoch", ylabel="RMSE")
plt.grid()
In [ ]:
# Interpretation
import dalex as dx
import plotly

plotly.offline.init_notebook_mode()  # for saving html with plotly plots

# Set up explainer
def pred_fun(m, X):
    return m.predict(preprocessor.transform(X), batch_size=1000, verbose=0).flatten()


exp = dx.Explainer(
    model,
    data=df_test[ord_features + ["carat"]],
    y=y_test,
    predict_function=pred_fun,
    verbose=False,
)

# Performance on test data
mp = exp.model_performance(model_type="regression")
print("Performance\n", mp.result)
Performance
                       mse      rmse        r2         mae         mad
Functional  341837.605583  584.6688  0.978364  321.518386  146.676117
In [ ]:
# Permutation importance
vi = exp.model_parts()
vi.plot()
In [ ]:
# Partial dependence
pdp_num = exp.model_profile(
    variables=["carat"], label="Partial depencence for numeric variables", verbose=False
)
pdp_num.plot()

pdp_ord = exp.model_profile(
    variable_type="categorical",
    variable_splits=dict(zip(ord_features, ord_levels)),
    label="Partial depencence for ordinal variables",
    verbose=False,
)
pdp_ord.plot(facet_scales="free")

Comment: Performance is lower than of the tree-based models. This might partly be a consequence of effects being smoother, but also because the model has not been refitted on the full training data for simplicity (20% of the training rows are used for validation).

Example: Embeddings¶

Representing categorical input variables through embedding layers is extremely useful in practice. We will end this chapter with an example on how to do it with the claims data. This example also shows how flexible neural network structures are.

In [ ]:
# Load and inspect data
import pandas as pd

car = pd.read_csv("car.csv")  # see readme how to get the data
car.head()
Out[ ]:
veh_value exposure clm numclaims claimcst0 veh_body veh_age gender area agecat _OBSTAT_
0 1.06 0.303901 0 0 0.0 HBACK 3 F C 2 01101 0 0 0
1 1.03 0.648871 0 0 0.0 HBACK 2 F A 4 01101 0 0 0
2 3.26 0.569473 0 0 0.0 UTE 2 F E 2 01101 0 0 0
3 4.14 0.317591 0 0 0.0 STNWG 2 F D 2 01101 0 0 0
4 0.72 0.648871 0 0 0.0 HBACK 4 F C 2 01101 0 0 0
In [ ]:
# Preprocessing
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Train/test split
df_train, df_test, y_train, y_test = train_test_split(
    car, car["clm"], test_size=0.2, stratify=car["clm"], random_state=3341
)

num_features = ["veh_value", "veh_age", "agecat"]
ord_features = ["gender", "area"]
ord_levels = [sorted(car[v].unique()) for v in ord_features]
dense_features = ord_features + num_features

# Preprocess dense features, i.e., everything except "veh_body"
prepare_dense = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("ordinal", OrdinalEncoder(categories=ord_levels), ord_features),
            ("numeric", "passthrough", num_features),
        ]
    ),
    StandardScaler(),
)

# Preprocess "veh_body"
prepare_embedding = ColumnTransformer(
    transformers=[("embedding", OrdinalEncoder(), ["veh_body"])]
)

prepare_dense.fit(df_train)
prepare_embedding.fit(df_train)

# Function that turns a DataFrame into a dict of input components for the net
def dict_provider(df):
    return {
        "dense1": prepare_dense.transform(df),
        "veh_body": prepare_embedding.transform(df),
    }


dict_provider(df_test.head())  # Check
Out[ ]:
{'dense1': array([[ 1.14978787, -1.23978753, -0.54677517,  1.2453385 ,  1.06238257],
        [-0.86972565,  0.85626104, -0.40579865,  0.30841342, -1.74556725],
        [-0.86972565, -1.23978753, -0.30628581, -0.62851166,  1.76437003],
        [-0.86972565,  0.15757819, -0.09067466, -1.56543674, -1.0435798 ],
        [ 1.14978787,  0.85626104,  1.65079999, -1.56543674,  1.76437003]]),
 'veh_body': array([[10.],
        [ 9.],
        [ 3.],
        [ 3.],
        [12.]])}
In [ ]:
# Define model
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Inputs
input_dense = keras.Input(shape=len(dense_features), name="dense1")
input_veh_body = keras.Input(shape=1, name="veh_body")

# Embedding of veh_body
m_bodies = df_train["veh_body"].nunique()
emb = layers.Embedding(input_dim=m_bodies, output_dim=1)(input_veh_body)
emb = layers.Flatten()(emb)

# Combine dense input and embedding and connect to output
x = layers.Concatenate()([input_dense, emb])
x = layers.Dense(30, activation="tanh")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

# Input
inputs = {"dense1": input_dense, "veh_body": input_veh_body}

# Create model
model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()
Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 veh_body (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 embedding (Embedding)          (None, 1, 1)         13          ['veh_body[0][0]']               
                                                                                                  
 dense1 (InputLayer)            [(None, 5)]          0           []                               
                                                                                                  
 flatten (Flatten)              (None, 1)            0           ['embedding[0][0]']              
                                                                                                  
 concatenate (Concatenate)      (None, 6)            0           ['dense1[0][0]',                 
                                                                  'flatten[0][0]']                
                                                                                                  
 dense_8 (Dense)                (None, 30)           210         ['concatenate[0][0]']            
                                                                                                  
 dense_9 (Dense)                (None, 1)            31          ['dense_8[0][0]']                
                                                                                                  
==================================================================================================
Total params: 254
Trainable params: 254
Non-trainable params: 0
__________________________________________________________________________________________________
In [ ]:
# Compile and fit model
model.compile(
    loss="binary_crossentropy", optimizer=keras.optimizers.Adam(learning_rate=0.0001)
)

# Callbacks
cb = [
    keras.callbacks.EarlyStopping(patience=20),
    keras.callbacks.ReduceLROnPlateau(patience=5),
]

# Fit
tf.random.set_seed(443)

history = model.fit(
    x=dict_provider(df_train),
    y=y_train,
    epochs=1000,
    batch_size=400,
    validation_split=0.2,
    callbacks=cb,
    verbose=0,
)
In [ ]:
# Plot average log loss over epochs
import matplotlib.pyplot as plt

plt.plot(history.history["loss"], label="Training")
plt.plot(history.history["val_loss"], label="Validation")
plt.legend()
plt.gca().set(title="Average loss over epochs", xlabel="Epoch", ylabel="Loss")
plt.grid()
In [ ]:
# Interpretation
import numpy as np
import dalex as dx
import plotly
from sklearn.metrics import log_loss

plotly.offline.init_notebook_mode()  # for saving html with plotly plots

# Set up explainer
def pred_fun(m, X):
    return m.predict(dict_provider(X), batch_size=1000, verbose=0).flatten()


exp = dx.Explainer(
    model,
    data=df_test[dense_features + ["veh_body"]],
    y=y_test,
    predict_function=pred_fun,
    verbose=False,
)

# Performance on test data
test_loss = log_loss(y_test, exp.predict(df_test))
test_loss0 = log_loss(y_test, np.repeat(y_train.mean(), len(y_test)))
rel_imp = (test_loss0 - test_loss) / test_loss0

print(f"Average test log loss:                    {test_loss: .3f}")
print(f"Relative improvement in average log loss: {rel_imp: .3%}")
Average test log loss:                     0.249
Relative improvement in average log loss:  0.041%
In [ ]:
# Permutation importance on test data
vi = exp.model_parts()
vi.plot()
In [ ]:
# Partial dependence
num_eval_at = {
    "veh_value": np.linspace(0, 5, 41),
    "agecat": sorted(car["agecat"].unique()),
    "veh_age": sorted(car["veh_age"].unique()),
}
pdp_num = exp.model_profile(
    variable_splits=num_eval_at,
    label="Partial depencence for numeric variables",
    verbose=False,
)
pdp_num.plot(facet_scales="free")

cat_eval_at = dict(zip(ord_features, ord_levels))
cat_eval_at["veh_body"] = sorted(car["veh_body"].unique())
pdp_cat = exp.model_profile(
    variable_type="categorical",
    variable_splits=cat_eval_at,
    label="Partial depencence for ordinal variables",
    verbose=False,
)
pdp_cat.plot(facet_scales="free")

Exercises¶

  1. Fit diamond prices by minimizing Gamma deviance with log-link (-> exponential output activation), using the custom loss function defined below. Tune the model by simple validation and evaluate it on a test dataset. Interpret the final model. Hints: I used a smaller learning rate and had to replace the "relu" activations by "tanh". Furthermore, the response needed to be transformed from int to float.
from tensorflow.keras import backend as K

def loss_gamma(y_true, y_pred):
  return -K.log(y_true / y_pred) + y_true / y_pred
  1. Study either the optional claims data example or build your own neural net, predicting claim yes/no. For simplicity, you can represent the categorical feature veh_body by integers.

Neural Network Slang¶

Here, we summarize some of the neural network slang.

  • Activation function: The transformation applied to the node values.
  • Architecture: The layout of layers and nodes.
  • Backpropagation: An efficient way to calculate gradients.
  • Batch: A couple of data rows used for one mini-batch gradient descent step.
  • Callback: An action during training (save weights, reduce learning rate, stop training, ...).
  • Epoch: The process of updating the network weights by gradient descent until each observation in the training set was used once.
  • Embedding: A numeric representation of categorical input as learned by the neural net.
  • Encoding: The values of latent variables of a hidden layer, usually the last.
  • Gradient descent: The basic optimization algorithm of neural networks.
  • Keras: User-friendly wrapper of TensorFlow.
  • Layer: Main organizational unit of a neural network.
  • Learning rate: Controls the step size of gradient descent, i.e., how aggressive the network learns.
  • Node: Nodes on the input layer are the covariates, nodes on the output layer the response(s) and nodes on a hidden layer are latent variables representing the covariates for the task to predict the response.
  • Optimizer: The specific variant of gradient descent.
  • PyTorch: An important implementation of neural networks.
  • Stochastic gradient descent (SGD): Mini-batch gradient descent with batches of size 1.
  • TensorFlow: An important implementation of neural networks.
  • Weights: The parameters of a neural net.

Chapter Summary¶

In this chapter, we have glimpsed into the world of neural networks. Step by step we have learned how a neural network works. We have used Keras and TensorFlow to build models brick by brick.

Closing Remarks¶

During this lecture, we have met many ML algorithms and principles. To get used to them, the best approach is practicing. Kaggle is a great place to do so and learn from the best.

A summary and comparison of the algorithms can be found on github. Here a screenshot as per Sept. 7, 2020:

As a final task and motivation, try out "Michael's analysis scheme X". It can be applied very frequently and works as follows:

  1. Take a property $T(Y)$ of key interest, e.g., the churn rate, a claims frequency, or a loss ratio. Calculate its estimate on the full dataset.
  2. Do a descriptive analysis of $T(Y \mid X_j)$ for a couple of important covariates $X_1, \dots, X_p$ in order to study the association between $Y$ and each of the $X_j$ separately.
  3. Complement Step 2 by running a well-built ML model $T(Y \mid X_1, \dots, X_p) = f(X_1, \dots, X_p)$, using a clean validation strategy.
    • Study model performance.
    • Study variable importance and use it to sort the results of Step 2. Which of the associations are very strong/weak?
    • For each $X_j$, study its partial dependence (or SHAP dependence) plot. They will accompany the bivariate view of Step 2 with a multivariate one, taking into account the effects from the other covariates.

What additional insights did you get from Step 3?

Chapter References¶

[1] P.J. Werbos, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences", Dissertation, 1974.